import pandas as pd
from sklearn import datasets
iris = datasets.load_iris(as_frame=True)
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target_names[iris.target]=='virginica'
y = pd.DataFrame(y)
y.columns = ['target']
display(X.head(), y.head())
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 |
| target | |
|---|---|
| 0 | False |
| 1 | False |
| 2 | False |
| 3 | False |
| 4 | False |
df = pd.DataFrame(data = iris.data, columns = iris.feature_names)
df["target"] = y
df.groupby("target").describe()
| sepal length (cm) | sepal width (cm) | ... | petal length (cm) | petal width (cm) | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| target | |||||||||||||||||||||
| False | 100.0 | 5.471 | 0.641698 | 4.3 | 5.000 | 5.4 | 5.9 | 7.0 | 100.0 | 3.099 | ... | 4.325 | 5.1 | 100.0 | 0.786 | 0.565153 | 0.1 | 0.2 | 0.8 | 1.3 | 1.8 |
| True | 50.0 | 6.588 | 0.635880 | 4.9 | 6.225 | 6.5 | 6.9 | 7.9 | 50.0 | 2.974 | ... | 5.875 | 6.9 | 50.0 | 2.026 | 0.274650 | 1.4 | 1.8 | 2.0 | 2.3 | 2.5 |
2 rows × 32 columns
import seaborn as sns
sns.histplot(data = df, x = "sepal width (cm)", hue="target")
<Axes: xlabel='sepal width (cm)', ylabel='Count'>
sns.histplot(data = df, x = "sepal length (cm)", hue="target")
<Axes: xlabel='sepal length (cm)', ylabel='Count'>
sns.histplot(data = df, x = "petal width (cm)", hue="target")
<Axes: xlabel='petal width (cm)', ylabel='Count'>
sns.histplot(data = df, x = "petal length (cm)", hue="target")
<Axes: xlabel='petal length (cm)', ylabel='Count'>
corr = df.corr()
sns.heatmap(corr, annot=True)
<Axes: >
import plotly.express as px
import plotly
plotly.offline.init_notebook_mode()
fig_boxplot = px.box(df, x="target", y="sepal width (cm)", title="Box Plot of Sepal width (cm) by target",
color='target' # Set box color based on Species
)
# Update layout
fig_boxplot.update_layout(xaxis_title="target", yaxis_title="Sepal width (cm)",)
# Show the box plot
fig_boxplot.show()
plotly.offline.init_notebook_mode()
fig_boxplot = px.box(df, x="target", y="sepal length (cm)", title="Box Plot of Sepal length (cm) by target",
color='target' # Set box color based on Species
)
# Update layout
fig_boxplot.update_layout(xaxis_title="target", yaxis_title="Sepal length (cm)",)
# Show the box plot
fig_boxplot.show()
plotly.offline.init_notebook_mode()
fig_boxplot = px.box(df, x="target", y="petal width (cm)", title="Box Plot of petal width (cm) by target",
color='target' # Set box color based on Species
)
# Update layout
fig_boxplot.update_layout(xaxis_title="target", yaxis_title="petal width (cm)",)
# Show the box plot
fig_boxplot.show()
plotly.offline.init_notebook_mode()
fig_boxplot = px.box(df, x="target", y="petal length (cm)", title="Box Plot of petal length (cm) by target",
color='target' # Set box color based on Species
)
# Update layout
fig_boxplot.update_layout(xaxis_title="target", yaxis_title="petal length (cm)",)
# Show the box plot
fig_boxplot.show()
The meaning of this graph:
What we can learn from it:
plotly.offline.init_notebook_mode()
fig = px.scatter_matrix(df, dimensions=["sepal width (cm)", "sepal length (cm)", "petal width (cm)", "petal length (cm)"], color="target")
fig.update_layout(width=1200, height=800)
fig.show()
The meaning of this graph:
What can we learn from it:
sns.pairplot(data=df, hue="target")
<seaborn.axisgrid.PairGrid at 0x22e43273250>
The meaning of this graph:
What can we learn from it:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size = 0.5)
from sklearn.linear_model import LogisticRegression
I chose features based on correlation matrix. (high correlation order)
petal width -> petal length -> sepal length -> sepal width
log_reg1 = LogisticRegression()
log_reg1.fit(X_train[['petal width (cm)']], y_train['target'])
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
log_reg2 = LogisticRegression()
log_reg2.fit(X_train[['petal width (cm)', 'petal length (cm)']], y_train['target'])
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
log_reg3 = LogisticRegression()
log_reg3.fit(X_train[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']], y_train['target'])
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
log_reg4 = LogisticRegression()
log_reg4.fit(X_train[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']], y_train['target'])
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
pred1 = log_reg1.predict(X_val[['petal width (cm)']])
prob1 = log_reg1.predict_proba(X_val[['petal width (cm)']])
if isinstance(y_val, pd.DataFrame):
y_val = y_val.squeeze()
results_df1 = pd.DataFrame({
'probability of predicting verginica': prob1[:, 1],
'actual prediction by the model': pred1,
'ground truth': y_val
})
results_df1
| probability of predicting verginica | actual prediction by the model | ground truth | |
|---|---|---|---|
| 41 | 0.004366 | False | False |
| 138 | 0.625598 | True | True |
| 36 | 0.002942 | False | False |
| 97 | 0.187305 | False | False |
| 12 | 0.001981 | False | False |
| 54 | 0.337329 | False | False |
| 88 | 0.187305 | False | False |
| 126 | 0.625598 | True | True |
| 7 | 0.002942 | False | False |
| 9 | 0.001981 | False | False |
| 71 | 0.187305 | False | False |
| 145 | 0.923747 | True | True |
| 58 | 0.187305 | False | False |
| 137 | 0.625598 | True | True |
| 61 | 0.337329 | False | False |
pred2 = log_reg2.predict(X_val[['petal width (cm)', 'petal length (cm)']])
prob2 = log_reg2.predict_proba(X_val[['petal width (cm)', 'petal length (cm)']])
if isinstance(y_val, pd.DataFrame):
y_val = y_val.squeeze()
results_df2 = pd.DataFrame({
'probability of predicting verginica': prob2[:, 1],
'actual prediction by the model': pred2,
'ground truth': y_val
})
results_df2
| probability of predicting verginica | actual prediction by the model | ground truth | |
|---|---|---|---|
| 41 | 0.000003 | False | False |
| 138 | 0.473268 | False | True |
| 36 | 0.000003 | False | False |
| 97 | 0.076144 | False | False |
| 12 | 0.000003 | False | False |
| 54 | 0.218866 | False | False |
| 88 | 0.045997 | False | False |
| 126 | 0.473268 | False | True |
| 7 | 0.000005 | False | False |
| 9 | 0.000004 | False | False |
| 71 | 0.035565 | False | False |
| 145 | 0.882237 | True | True |
| 58 | 0.155556 | False | False |
| 137 | 0.854403 | True | True |
| 61 | 0.087494 | False | False |
pred3 = log_reg3.predict(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']])
prob3 = log_reg3.predict_proba(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']])
if isinstance(y_val, pd.DataFrame):
y_val = y_val.squeeze()
results_df3 = pd.DataFrame({
'probability of predicting verginica': prob3[:, 1],
'actual prediction by the model': pred3,
'ground truth': y_val
})
results_df3
| probability of predicting verginica | actual prediction by the model | ground truth | |
|---|---|---|---|
| 41 | 0.000004 | False | False |
| 138 | 0.493186 | False | True |
| 36 | 0.000002 | False | False |
| 97 | 0.070013 | False | False |
| 12 | 0.000003 | False | False |
| 54 | 0.189062 | False | False |
| 88 | 0.053410 | False | False |
| 126 | 0.469783 | False | True |
| 7 | 0.000005 | False | False |
| 9 | 0.000004 | False | False |
| 71 | 0.032481 | False | False |
| 145 | 0.861194 | True | True |
| 58 | 0.127914 | False | False |
| 137 | 0.855575 | True | True |
| 61 | 0.089958 | False | False |
pred4 = log_reg4.predict(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']])
prob4 = log_reg4.predict_proba(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']])
if isinstance(y_val, pd.DataFrame):
y_val = y_val.squeeze()
results_df4 = pd.DataFrame({
'probability of predicting verginica': prob4[:, 1],
'actual prediction by the model': pred4,
'ground truth': y_val
})
results_df4
| probability of predicting verginica | actual prediction by the model | ground truth | |
|---|---|---|---|
| 41 | 0.000004 | False | False |
| 138 | 0.472638 | False | True |
| 36 | 0.000001 | False | False |
| 97 | 0.067502 | False | False |
| 12 | 0.000002 | False | False |
| 54 | 0.196136 | False | False |
| 88 | 0.046781 | False | False |
| 126 | 0.478440 | False | True |
| 7 | 0.000003 | False | False |
| 9 | 0.000003 | False | False |
| 71 | 0.032597 | False | False |
| 145 | 0.861440 | True | True |
| 58 | 0.127190 | False | False |
| 137 | 0.842569 | True | True |
| 61 | 0.081819 | False | False |
(Number of correct predictions/total number of predictions)
from sklearn.metrics import accuracy_score
accuracy1 = accuracy_score(y_val, pred1)
print("Accuracy of the logistic regression model1 is: ", accuracy1*100, "%")
accuracy2 = accuracy_score(y_val, pred2)
print("Accuracy of the logistic regression model2 is: ", accuracy2*100, "%")
accuracy3 = accuracy_score(y_val, pred3)
print("Accuracy of the logistic regression model3 is: ", accuracy3*100, "%")
accuracy4 = accuracy_score(y_val, pred4)
print("Accuracy of the logistic regression model4 is: ", accuracy4*100, "%")
Accuracy of the logistic regression model1 is: 100.0 % Accuracy of the logistic regression model2 is: 86.66666666666667 % Accuracy of the logistic regression model3 is: 86.66666666666667 % Accuracy of the logistic regression model4 is: 86.66666666666667 %
import matplotlib.pyplot as plt
import numpy as np
X_val_feature = X_val['petal width (cm)']
decision_boundary = -log_reg1.intercept_ / log_reg1.coef_[0]
plt.scatter(X_val_feature, y_val, c=y_val, edgecolor='b')
plt.axvline(x=decision_boundary, color='green')
plt.xlabel('Petal Width (cm)')
plt.show()
decision_boundary_x1 = np.linspace(X_val['petal width (cm)'].min(), X_val['petal width (cm)'].max(), 10)
decision_boundary_x2 = -log_reg2.intercept_ / log_reg2.coef_[0][1] - log_reg2.coef_[0][0] / log_reg2.coef_[0][1] * decision_boundary_x1
plt.scatter(X_val['petal width (cm)'], X_val['petal length (cm)'], c=y_val, edgecolor='b')
plt.plot(decision_boundary_x1, decision_boundary_x2)
[<matplotlib.lines.Line2D at 0x22e4448e990>]
y_val_numeric = y_val.astype(int)
import plotly.graph_objs as go
import numpy as np
plotly.offline.init_notebook_mode()
min_feature_value1 = X_val['petal width (cm)'].min(axis=0) - 1
max_feature_value1 = X_val['petal width (cm)'].max(axis=0) + 1
min_feature_value2 = X_val['petal length (cm)'].min(axis=0) - 1
max_feature_value2 = X_val['petal length (cm)'].max(axis=0) + 1
x1, x2 = np.meshgrid(np.linspace(min_feature_value1, max_feature_value1, 100),
np.linspace(min_feature_value2, max_feature_value2, 100))
x3 = (-log_reg3.intercept_ - log_reg3.coef_[0][0] * x1 - log_reg3.coef_[0][1] * x2) / log_reg3.coef_[0][2]
decision_boundary_surface = go.Surface(x=x1, y=x2, z=x3, colorscale='Viridis', opacity=0.5)
data_scatter = go.Scatter3d(x=X_val['petal width (cm)'], y=X_val['petal length (cm)'], z=X_val['sepal length (cm)'],
mode='markers',
marker=dict(size=5, color=y_val_numeric, colorscale='Bluered', opacity=0.8))
layout = go.Layout(title='3D plot with decision boundary',
scene=dict(xaxis_title='petal width (cm)',
yaxis_title='petal length (cm)',
zaxis_title='sepal length (cm)'),
margin=dict(l=0, r=0, b=0, t=0))
fig = go.Figure(data=[decision_boundary_surface, data_scatter], layout=layout)
fig.show()
According to tables which have predict and predict prob methods, 138 and 126 instances failed in both 2 features model, 3 features model, and 4 features model.
There are some failure patterns. According to the graphs above, these instances plotted close to the boundary decisions across all the models.
X_val.iloc[1], X_val.iloc[7]
(sepal length (cm) 6.0 sepal width (cm) 3.0 petal length (cm) 4.8 petal width (cm) 1.8 Name: 138, dtype: float64, sepal length (cm) 6.2 sepal width (cm) 2.8 petal length (cm) 4.8 petal width (cm) 1.8 Name: 126, dtype: float64)
I will recommend the first logistic regression model(1 feature: petal width) as the best model.
pred_test = log_reg1.predict(X_test[['petal width (cm)']])
prob_test = log_reg1.predict_proba(X_test[['petal width (cm)']])
if isinstance(y_test, pd.DataFrame):
y_test = y_test.squeeze()
results_df_test = pd.DataFrame({
'probability of predicting verginica': prob_test[:, 1],
'actual prediction by the model': pred_test,
'ground truth': y_test
})
results_df_test
| probability of predicting verginica | actual prediction by the model | ground truth | |
|---|---|---|---|
| 22 | 0.002942 | False | False |
| 84 | 0.337329 | False | False |
| 100 | 0.963972 | True | True |
| 3 | 0.002942 | False | False |
| 57 | 0.065607 | False | False |
| 118 | 0.923747 | True | True |
| 14 | 0.002942 | False | False |
| 128 | 0.845793 | True | True |
| 123 | 0.625598 | True | True |
| 5 | 0.006474 | False | False |
| 127 | 0.625598 | True | True |
| 33 | 0.002942 | False | False |
| 21 | 0.006474 | False | False |
| 120 | 0.923747 | True | True |
| 42 | 0.002942 | False | False |
accuracy_test = accuracy_score(y_test, pred_test)
print("Accuracy of the logistic regression model1 on test is: ", accuracy_test*100, "%")
Accuracy of the logistic regression model1 on test is: 100.0 %